Most scripting languages nowadays use regex pattern-matching libraries. Theseregex libraries borrow the syntax of regular expressions, but have an informalsemantics that is different from the semantics of regular expressions, removingthe commutativity of alternation and adding ad-hoc extensions that cannot beexpressed by formalisms for efficient recognition of regular languages, such asdeterministic finite automata. Parsing Expression Grammars are a formalism that can describe alldeterministic context-free languages and has a simple computational model. Inthis paper, we present a formalization of regexes via transformation to ParsingExpression Grammars. The proposed transformation easily accommodates several ofthe common regex extensions, giving a formal meaning to them. It also providesa clear computational model that helps to estimate the efficiency ofregex-based matchers, and a basis for specifying provably correct optimizationsfor them.
展开▼